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ABSTRACT 

We present a method for radical linear compression of datasets where the data are dependent 
on some number M of parameters. We show that, if the noise in the data is independent of 
the parameters, we can form M linear combinations of the data which contain as much infor- 
mation about all the parameters as the entire dataset, in the sense that the Fisher information 
matrices are identical; i.e. the method is lossless. We explore how these compressed numbers 
fare when the noise is dependent on the parameters, and show that the method, although not 
precisely lossless, increases errors by a very modest factor The method is general, but we 
illustrate it with a problem for which it is well-suited: galaxy spectra, whose data typically 
consist of ^ lO'' fluxes, and whose properties are set by a handful of parameters such as age, 
brightness and a parametrised star formation history. The spectra are reduced to a small num- 
ber of data, which are connected to the physical processes entering the problem. This data 
compression offers the possibility of a large increase in the speed of determining physical 
parameters. This is an important consideration as datasets of galaxy spectra reach 10^ in size, 
and the complexity of model spectra increases. In addition to this practical advantage, the 
compressed data may offer a classification scheme for galaxy spectra which is based rather 
directly on physical processes. 



1 INTRODUCTION 



There are many instances where objects consist of many data, 
whose values are determined by a small number of parameters. Of- 
ten, it is only these parameters which are of interest. The aim of this 
paper is to find linear combinations of the data which are focussed 
on estimating the physical parameters with as small an error as pos- 
sible. Such a problem is very general, and has been attacked in the 
case of parameter estimation in large-scale structure and the mi- 
crowave background (e.g. Tegmark, Taylor & Heavens 1997 [here- 
after TTH], Bond, Jaffe & Knox 1997, Tegmark 1997a, Tegmark 
1997b). Previous work has concentrated largely on the estimation 
of a single parameter; the main advance of this paper is that it sets 
out a method for the estimation of multiple parameters. The method 
provides one projection per parameter, with the consequent pos- 
sibility of a massive data compression factor. Furthermore, if the 
noise in the data is independent of the parameters, then the method 
is entirely lossless, i.e. the compressed dataset contains as much 
information about the parameters as the full dataset, in the sense 
that the Fisher information matrix is the same for the compressed 
dataset as the entire original dataset. An equivalent statement is that 
the mean likelihood surface is at the peak locally identical when the 
full or compressed data are used. 

We illustrate the method with the case of galaxy spectra, for 
which there are surveys underway which will provide ~ 10® ob- 
jects. In this application, the noise is generally not independent of 
the parameters, as there is a photon shot-noise component which 
depends on how many photons are expected. We take a spectrum 
with poor signal-to-noise, whose noise is approximately from pho- 



ton counting alone, and investigate how the method fares. In this 
case, the method is not lossless, but the increase in error bars is 
shown to be minimal, and superior in this respect to an alternative 
compression system PGA (Principal Component Analysis). 

One advantage such radical compression offers is speed of 
analysis. One major scientific goal of galaxy spectral surveys is 
to determine physical parameters of the stellar component of the 
galaxies, such as the age, star formation history, initial mass func- 
tion and so on. Such a process can, in principle, be achieved by gen- 
erating model galaxy spectra by stellar population synthesis tech- 
niques, and finding the best-fitting model by maximum-likelihood 
techniques. This can be very time-consuming, and must inevitably 
be automated for so many galaxies. In addition, one may have a 
large parameter space to explore, so any method which can speed 
up this process is worth investigation. One possible further applica- 
tion of the data compression method is that the handful of numbers 
might provide the basis of a classification scheme which is based 
on the physical properties one wants to measure. 

The outline of the paper is as follows: in section II we set out 
the lossless compression method for noise which is independent of 
the parameters; the proof appears in the appendix. In section III 
we discuss the more general case where the noise covariance ma- 
trix and the mean signal both depend on the parameters. In section 
IV we show through a worked example of galaxy spectra that the 
method, although not lossless, works very well in the general case. 
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2 METHOD 

We represent our data by a vector x^, i = 1, . . . , (e.g. a set 
of fluxes at different wavelengths). These measurements include a 
signal part, which we denote by fi, and noise, n: 



X = /X + n 



(1) 



Assuming the noise has zero mean, (x) = /j,. The signal will de- 
pend on a set of parameters {6 a}, which we wish to determine. For 
galaxy spectra, the parameters may be, for example, age, magni- 
tude of source, metallicity and some parameters describing the star 
formation history. Thus, /x is a noise-free spectrum of a galaxy with 
certain age, metallicity etc. 

The noise properties are described by the noise covariance ma- 
trix, C, with components C,j = {rurij). If the noise is gaussian, 
the statistical properties of the data are determined entirely by fi 
and C. In principle, the noise can also depend on the parameters. 
For example, in galaxy spectra, one component of the noise will 
come from photon counting statistics, and the contribution of this 
to the noise will depend on the mean number of photons expected 
from the source. 

The aim is to derive the parameters from the data. If we as- 
sume uniform priors for the parameters, then the a posteriori prob- 
ability for the parameters is the likelihood, which for gaussian noise 
is 



(27r)^/Vdet(C) 



exp 



(2) 



One approach is simply to find the (highest) peak in the likelihood, 
by exploring all parameter space, and using all A'' pixels. The posi- 
tion of the peak gives estimates of the parameters which are asymp- 
totically (low noise) the best unbiased estimators (see TTH). This is 
therefore the best we can do. The maximum-likelihood procedure 
can, however, be time-consuming if N is large, and the parame- 
ter space is large. The aim of this paper is to see whether we can 
reduce the N numbers to a smaller nimiber, without increasing the 
uncertainties on the derived parameters 9a ■ To be specific, we try to 
find a number N' < N of linear combinations of the spectral data 
X which encompass as much as possible of the information about 
the physical parameters. We find that this can be done losslessly 
in some circumstances; the spectra can be reduced to a handful of 
numbers without loss of information. The speed-up in parameter 
estimation is about a factor ~ 100. 

In general, reducing the dataset in this way will lead to larger 
error bars in the parameters. To assess how well the compression is 
doing, consider the behaviour of the (logarithm of the) likelihood 
function near the peak. Performing a Taylor expansion and truncat- 
ing at the second-order terms. 



In £ = In /Ipoak + 



1 d'^lnC 

2 ddadOp 



(3) 



Tnmcating here assumes that the likelihood surface itself is ad- 
equately approximated by a gaussian everywhere, not just at the 
maximum-likelihood point. The actual likelihood surface will vary 
when different data are used; on average, though, the width is set 
by the (inverse of the) Fisher information matrix: 



(4) 



where the average is over an ensemble with the same parameters 
but different noise. 

For a single parameter, the Fisher matrix F is a scalar F, and 
the error on the parameter can be no smaller than F~^^^. If the 
data depend on more than one parameter, and all the parameters 
have to be estimated from the data, then the error is larger. The er- 
ror on one parameter a (marginalised over the others) is at least 
[(F"^)aQ] Kendall & Stuart (1969). There is a little more dis- 
cussion of the Fisher matrix in Tegmark, Taylor & Heavens (1997), 
hereafter TTH. The Fisher matrix depends on the signal and noise 
terms in the following way (TTH, equation 15) 

Fag = ilV [C-^C,„C-^C,;3 + C-^(m,„m^^ + ^l,g^l*a)] -(5) 

where the comma indicates derivative with respect to the parameter. 
If we use the full dataset x, then this Fisher matrix represents the 
best that can possibly be done via likelihood methods with the data. 

In practice, some of the data may tell us very little about the 
parameters, either through being very noisy, or through having no 
sensitivity to the parameters. So in principle we may be able to 
throw some data away without losing very much information about 
the parameters. Rather than throwing individual data away, we can 
do better by forming linear combinations of the data, and then 
throwing away the combinations which tell us least. To proceed, 
we first consider a single Unear combination of the data: 



2/ = b X 



(6) 



for some weighting vector b (t indicates transpose). We will try to 
find a weighting which captures as much information about a par- 
ticular parameter, Oa- If we assume we know all the other parame- 
ters, this amounts to maximising Faa- The dataset (now consisting 
of a single number) has a Fisher matrix, which is given in TTH 
(equation 25) by: 



' a/3 ■ 



1 / b*C,^b 

2 [ b*Cb 



b*C,^b 
b*Cb 



+ 



(bV.j(bV^^) 

(b*Cb) 



(7) 



Note that the denominators are simply numbers. It is clear from this 
expression that if we multiply b by a constant, we get the same F. 
This makes sense: multiplying the data by a constant factor does 
not change the information content. We can therefore fix the nor- 
malisation of b at our convenience. To simplify the denominators, 
we therefore maximise Faa subject to the constraint 



b*Cb 



(8) 



The most general problem has both the mean /x and the covari- 
ance matrix C depending on the parameters of the spectrum, and 
the resulting maximisation leads to an eigenvalue problem which is 
nonlinear in b. We are unable to solve this, so we consider a case for 
which an analytic solution can be found. TTH showed how to solve 
for the case of estimation of a single parameter in two special cases: 
1) when /X is known, and 2) when C is known (i.e. doesn't depend 
on the parameters). We will concentrate on the latter case, but gen- 
eralise to the problem of estimating many parameters at once. For 
a single parameter, TTH showed that the entire dataset could be 
reduced to a single number, with no loss of information about the 
parameter. We show below that, if we have M parameters to es- 
timate, then we can reduce the dataset to M numbers. These M 
numbers contain just as much information as the original dataset; 
i.e. the data compression is lossless. 

We consider the parameters in turn. With C independent of 
the parameters, F simplifies, and, maximising Fn subject to the 
constraint requires 
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obi 



(9) 



where A is a Lagrange multiplier, and we assume the summation 
convention {j, fc G [1, A'']). This leads to 



/X i(b'^ j) — ACb 

with solution, properly normalised 



bi = 



(10) 



(11) 



and our compressed datum is the single number yi — b^x. This 
solution makes sense - ignoring the unimportant denominator, the 
method weights high those data which are parameter-sensitive, and 
low those data which are noisy. 

To see whether the compression is lossless, we compare the 
Fisher matrix element before and after the compression. Substitu- 
tion of bi into (R) gives 



Fii = At.iC fi 1 



(12) 



which is identical to the Fisher matrix element using the full data 
(equation |5|) if C is independent of 6i . Hence, as claimed by TTH, 
the compression from the entire dataset to the single number yi 
loses no information about 9i. For example, if ^ (x 6, then yi = 
Si ■'^'Z Mi and is simply an estimate of the parameter itself. 



2.0.1 Fiducial model 

It is important to note that yi contains as much information about 
9\ only if all other parameters are known, and also provided that 
the covariance matrix and the derivative of the mean in ( pi] ) are 
those at the maximum likelihood point. We turn to the first of these 
restrictions in the next section, and discuss the second one here. 

In practice, one does not know beforehand what the true solu- 
tion is, so one has to make an initial guess for the parameters. This 
guess we refer to as the fiducial model. We compute the covari- 
ance matrix C and the gradient of the mean (//,ti) for this fiducial 
model, to construct bi. The Fisher matrix for the compressed da- 
tum is (p^, but with the fiducial values inserted. In general this is 
not the same as Fisher matrix at the true solution. In practice one 
can iterate: choose a fiducial model; use it to estimate the parame- 
ters, and then repeat, using the estimate as the estimated parameters 
as the fiducial model. As our example in section 4 shows, such it- 
eration may be completely unnecessary. 



2.1 Estimation of many parameters 

The problem of estimating a single parameter from a set of data is 
unusual in practice. Normally one has several parameters to esti- 
mate simultaneously, and this introduces substantial complications 
into the analysis. How can we generalise the single-parameter esti- 
mate above to the case of many parameters? We proceed by finding 
a second number j/2 = b2X by the following requirements: 

• j/2 is uncorrelated with j/i. This demands that b2Cbi — 0. 

• t/2 captures as much information as possible about the second 
parameter 62 . 

This requires two Lagrange multipliers (we normalise b2 by de- 
manding that b|Cb2 = 1 as before). Maximising and applying 
the constraints gives the solution 



C~V.2 - (M!2bi)bi 



(13) 



This is readily generalised to any number M of parameters. There 
are then M orthogonal vectors hm, m = 1, . . . M, each ym captur- 
ing as much information about parameter a™ which is not already 
contained in y,; q < m. The constrained maximisation gives 



(14) 



This procedure is analogous to Gram-Schmidt orthogonalisation 
with a curved metric, with C playing the role of the metric tensor. 
Note that the procedure gives precisely M eigenvectors and hence 
AI numbers, so the dataset has been compressed from the original 
A'^ data down to the number of parameters AI. 

Since, by construction, the numbers ym are uncorrelated, the 
likelihood of the parameters is obtained by multiplication of the 
likelihoods obtained from each statistic ym. The ym have mean 
(y-m) = hmH and unit variance, so the likelihood from the com- 
pressed data is simply 



\njC{6a) = constant 



m — 1 



(15) 



and the Fisher matrix of the combined numbers is just the sum of 
the individual Fisher matrices. Note once again the role of the fidu- 
cial model in setting the weightings hm : the orthonormality of the 
new numbers only holds if the fiducial model is correct. Multipli- 
cation of the likelihoods is thus only approximately correct, but 
iteration could be used if desired. 



2.1.1 Proof that the method can be lossless for many parameters 

Under the assumption that the covariance matrix is independent of 
the parameters, reduction of the original data to the M numbers ym 
results in no loss of information about the AI parameters at all. In 
fact the set {ym} produces, on average, a likelihood surface which 
is locally identical to that from the entire dataset - no information 
about the parameters is lost in the compression process. With the 
restriction that the information is defined locally by the Fisher ma- 
trix, the set {ym} is a set of sufficient statistics for the parameters 
{9c,} (e.g. Koch 1999). A proof of this for an arbitrary number of 
parameters is given in the appendix. 



3 THE GENERAL CASE 

In general, the covariance matrix does depend on the parameters, 
and this is the case for galaxy spectra, where at least one compo- 
nent of the noise is parameter-dependent. This is the photon count- 
ing noise, for which Ca = /i^. TTH argued that it is better to treat 
this case by using the n eigenvectors which arise from assuming 
the mean is known, rather than the single number (for one parame- 
ter) which arises if we assume that the covariance matrix is known, 
as above. We find that, on the contrary, the small number of eigen- 
vectors bm allow a much greater degree of compression than the 
known-mean eigenvectors (which in this case are simply individ- 
ual pixels, ordered by |/x q//x|). For data signal-to-noise of around 
2, the latter allow a data compression by about a factor of 2 be- 
fore the errors on the parameters increase substantially, whereas the 
method here allows drastic compression from thousands of num- 
bers to a handful. To show what can be achieved, we use a set of 
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simulated galaxy spectra to constrain a few parameters characteris- 
ing the galaxy star formation history. 



3.1 Parameter Eigenvectors 

In the case when the covariance matrix is independent of the pa- 
rameters, it does not matter which parameter we choose to form j/i, 
2/2, etc, as the likelihood surface from the compressed numbers is, 
on average, locally identical to that from the full dataset. However, 
in the general case, the procedure does lose information, and the 
amount of information lost could depend on the order of assign- 
ment of parameters to m. If the parameter estimates are correlated, 
as we will see in Fig. ^, the error in both parameters is dominated 
by the length of the likelihood contours along the 'ridge'. It makes 
sense then to diagonalise the matrix of second derivatives of In C at 
the fiducial model, and use these as the parameters (temporarily), 
as proposed by Ballinger et al. 2000 for galaxy surveys. The pa- 
rameter eigenvalues would order the importance of the parameter 
combinations to the likelihood. The procedure would be to take the 
smallest eigenvalue (with eigenvector lying along the ridge), and 
make the likelihood surface as narrow as possible in that direction. 
One then repeats along the parameter eigenvectors in increasing or- 
der of eigenvalue. 

Specifically, diagonalise Fc/s in (|^, to form a diagonal covari- 
ance matrix A — S*FS. The orthogonal parameter combinations 
are = S'S, where S has the normalised eigenvectors of F as its 
columns. The weighting vectors hm are then computed from ( [l4| ) 
by replacing /x^^ by SprH^^^. 



4 A WORKED EXAMPLE: GALAXY SPECTRA 

We start by investigating a two-parameter model. We have run a 
grid of stellar evolution models, with a burst of star formation at 
time —t, where t = is the present day. The star formation rate 
is SFR{t') = AS(t' + t) where 5 is a Dirac delta function. The 
two parameters to determine are age t and normalisation A. Fig. |l| 
shows some spectra with fixed normalisation (IM© of stars pro- 
duced) and different age. There are n — 352 pixels between 300 
and 1000 nm. Real data will be more complicated (variable trans- 
mission, instrumental noise etc) but this system is sufficiently com- 
plex to test the methods in essential respects. For simplicity, we as- 
sume that the noise is gaussian, with a variance given by the mean, 
C = diag(/ij, . . .). This is appropriate for photon number counts 
when the number is large. We assume the same behaviour, even 
with small numbers, for illustration, but there is no reason why 
a more complicated noise model cannot be treated. It should be 
stressed that this is a more severe test of the model than a typi- 
cal galaxy spectrum, where the noise is likely to be dominated by 
sources independent of the galaxy, such as CCD read-out noise or 
sky background counts. In the latter case, the compression method 
will do even better than the example here. 

The simulated galaxy spectrum is one of the galaxy spec- 
tra (age 3.95 Gyr, model number 100), and the maximum signal- 
to-noise per bin is taken to be 2. Noise is added, approximately 
photon noise, with a gaussian distribution with variance equal to 
the number of photons in each channel (Fig. |l|). Hence C = 

diag(Mi,M2'---)- 

The most probable values for the age and normalisation (as- 
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Wavelength/Angstroms 



Figure 1. Top panel: example model spectra, with age increasing down- 
wards. Bottom panel: simulated galaxy spectrum (including noise), whose 
properties we wish to determine, superimposed on noise-free spectrum of 
galaxy with the same age. 




Figure 2. Full likelihood solution using all pixels. There are 6 contours 
running down from the peak value in steps of 0.5 (in In C), and 3 outer 
contours at —100, —1000 and —10000. The triangle in the upper-right 
comer marks the fiducial model which determines the eigenvectors bi^2- 



suming uniform priors) is given by maximising the likelihood: 

1 



£ (age, norm) — 



(27r)"/2yn^ 



exp 



IE 



(16) 



where /i depends on age and normalisation. In £ is shown in Fig. 
^ Since this uses all the data, and all the approximations hold, this 
is the best that can be done, given the S/N of the spectrum. To 
solve the eigenvalue problem for b requires an initial guess for the 
spectrum. This 'fiducial model' was chosen to have an age of 8.98 
Gyr, i.e. very different from the true solution (model number 150 
rather than 100). This allows us to compute the eigenvector bi from 
([u]). This gives the single number j/i = b^x. With this as the 
datum, the likelihood for age and normalisation is 



£(age, norm) = 



exp 



(17) 



where (yi) — h\fj,. Note that the mean and covariance matrix 
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Figure 3. Eigenvectors — bi (age) and — b2 (normalisation). Wavelength 
A is in Angstroms. Note that the weights in hi are negative, which is why 
the sign has been changed for plotting: the blue (left) end of the spectrum 
which is weighted most heavily for y\ . This is expected as this part of the 
spectrum changes most rapidly with age. Note that these weightings dif- 
fer by a constant; this feature is special to the amplitude parameter, and is 
explained in the text. 




Age Index 

Figure 4. Likelihood solution for the age datum j/i . Contours are as in Fig. 
i 

here depend on the parameters - i.e. they are not from the fiducial 
model. The resultant likelihood is shown in Fig. ^. Clearly it does 
less well than the full solution, but it does constrain the parame- 
ters to a narrow ridge, on which the true solution (age model=100, 
log(normalisation) = lies. The second eigenvector b2 is ob- 
tained by taking the normalisation as the second parameter. The 
vector is shown in the lower panel of Fig. ^ The normalisation 
parameter is rather a special case, which results in b2 differing 
from bi only by a constant offset in the weights (For this param- 
eter fj, ^ = fj, and so C^^/i ^ — (1,1,..., 1)*). The likelihood 
for the parameters with y2 as the single datum is shown in Fig. |5| 
On its own, it does not tightly constrain the parameters, but when 
combined with j/i, it does remarkably well (Fig. 

4.1 Three-parameter estimation 

We complicate the situation now to a 3-parameter star-formation 
rate SFR{t) — ylexp(— t/r), and estimate A, t and r. Chemical 
evolution is included by using a simple closed-box model (with in- 




50 100 150 200 

Age Index 



Figure 5. Likelihood solution for the normalisation datum j/2. Contours 
are as in Fig. 0. 




Age Index 

Figure 6. Likelihood solution for the age datum yi and the normalisation 
datum y2 . Contours are as in Fig. ^ 

stantaneous recycling; Pagel 1997). This affects the depths of the 
absorption lines. If we follow the same procedure as before, choos- 
ing (t,A,T) as the order for computing bi, b2 and ha, then the 
product of the likelihoods from yi, j/2 and j/a is as shown in the 
right panel of Fig. ^ The left panel shows the likelihood from the 
full dataset of 1000 numbers, which does little better than the 3 
compressed numbers. It is interesting to explore how the parameter 
eigenvector method fares in this case. Here we follow the proce- 
dure in section 2, and maximise the curvature along the ridge first. 
The resulting three numbers constrain the parameters as in the mid- 
dle panel; in this case there is no apparent improvement over using 
eigenvectors from {t,A,T), but it may be advantageous in other 
applications. 

4.2 Estimate of increase in errors 

For the noise model we have adopted, we can readily compute the 
increase in the conditional error for one of the parameters - the 
normalisation of the spectrum. This serves as an illustration of how 
much information is lost in the compression process. In this case, 
C = /X, and C,c« = A* „ ~ f^c ^'^^ '^e Fisher matrix (a single 
element) can be written in terms of the total number of photons 
and the number of spectral pixels. From = A'^photons + 
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Figure 7. (Left) Likelihood solution for the full dataset of 1000 numbers 
for a single galaxy, as a function of t/Gyr, r/Gyr and amplitude. (Mid- 
dle) Likelihood for 3 compressed numbers, from parameter eigenvectors. 
(Right) likelihood surface from 3 compressed numbers (age, normalisation 
and r eigenvectors). All contours shown are 3.13 below the peak in InC; 
the iiTegularities in the surface are artefacts of the surface-drawing routine. 



A*'pixeis/2. The compressed data, on the other hand, have a Fisher 
matrix F = A'photons + 1/2, so the error bar on the normalisation 
is increased by a factor 



Fractional error increase : 



_l_ 

2l 



(18) 



for iVphotons > 1, and s = A'^photons/A'^pixois is the average num- 
ber of photons per pixel. Even if s is as low as 2, we see that the 
error bar is increased only by around 12%. 



4.3 Computational issues 

We have reduced the likelihood problem in this case by a factor 
of more than a hundred. The eigenproblem is trivial to solve. The 
work to be done is in reducing a whole suite of model spectra to M 
numbers, and by forming scalar products of them with the vectors 
hm- This is a one-shot task, and trivial in comparison with the job 
of generating the models. 



4.4 Role of fiducial model 

The fiducial model sets the weightings bm • After this step, the like- 
lihood analysis is correct for each , even if the fiducial model is 
wrong. The only place where there is an approximation is in the 
multiplication of the likeUhoods from all y,n to estimate finally 
the parameters. The ym are strictly only uncorrelated if the fidu- 
cial model coincides with the true model. This approximation can 
be dropped, if desired, by computing the correlations of the y„i 
for each model tested. We have explored how the fiducial model 
affects the recovered parameters, and an example result from the 
two-parameter problem is shown in Fig. ^. Here the ages and nor- 
malisations of a set of 'true' galaxies with S/N < 2 are estimated, 
using a common (9Gyr) galaxy as the fiducial model. We see that 
the method is successful at recovering the age, even if the fidu- 
cial model is very badly wrong. There are errors, of course, but the 
important aspect is whether the compressed data do significantly 
worse than the full dataset of 352 numbers. Fig. ^ shows that this is 
not the case. 

Although it appears from this example to be unnecessary, if 
one wants to improve the solution, then it is permissible to iter- 
ate, using the first estimate as the fiducial model. This adds to the 
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Figure 8. The effect of the fiducial model on recovery of the parameters. 
Here a single fiducial model is chosen (with age 9 Gyr), and ages recovered 
from many true galaxy spectra with ages between zero and 14 Gyr. Top left 
panel shows the recovered age from the two numbers yi and j/2 (with age 
and normalisation weightings), plotted against the true model age. Top right 
shows how well the full dataset (with S/N < 2) can recover the parameters. 
The lower panel shows the estimated age from the yi and j/2 plotted against 
the age recovered from the full dataset. showing that the compression adds 
very little to the error, even if the fiducial model is very wrong. Note also 
that the scatter increases with age; old galaxies are more difficult to date 
accurately. 



computational task, but not significantly; assuming the first itera- 
tion gives a reasonable parameter estimate, then one does not have 
to explore the entire parameter space in subsequent iterations. 



5 COMPARISON WITH PRINCIPAL COMPONENT 
ANALYSIS 

It is interesting to compare with other data compression and param- 
eter estimation methods. For example. Principal Component Anal- 
ysis is another linear method (e.g. Murtagh & Heck 1987, Fran- 
cis et al. 1992, Connolly et al. 1995, Folkes, Lahav & Maddox 
1996,Sodre & Cuevas 1996, Galaz & deLapparent 1998, Brom- 
ley et al. 1998, Glazebrook, Offer & Deeley 1998, Singh, Gulati & 
Gupta 1998, Connolly & Szalay 1999, Ronen, Aragon-Salamanca 

6 Lahav 1999, Folkes et al. 1999), which projects the data onto 
eigenvectors of the covariance matrix, which is determined em- 
pirically from the scatter between flux measurements of different 
galaxies. Part of the covariance matrix in PCA is therefore deter- 
mined by differences in the models, whereas in our case C refers 
to the noise alone. PCA then finds uncorrelated projections which 
contribute in decreasing amounts to the variance between galaxies 
in the sample. 

One finds that the first principal component is correlated with 
the galaxy age (Ronen, Aragon-Salamanca & Lahav 1999). Fig- 
ure M shows the PCA eigenvectors obtained from a set of 20 burst 
model galaxies which differ only in age, and Figure |l^ shows the 
resultant likelihood from the first two principal components. In the 
language of this paper, the principal components are correlated, so 
the 2x2 covariance matrix is used to determine the likelihood. We 
see that the components do not do nearly as well as the parameter 
eigenvectors; they do about as well as yi on its own. For interest, 
we plot the first principal component and j/i vs. age in Figure |ll[ 
In the presence of noise (S/N < 2 per bin), yi is almost mono- 
tonic with age, whereas PCI is not. Since PCA is not optimised for 
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Figure 9. The first two principal component eigenvectors, from a system 
of model spectra consisting of a burst at different times. 



Figure 11. First principal component (PCI) and yi versus age. One in 
every 10 models was used to do the PCA. In the presence of noise, at a 
level of S/N < 2 per bin, yi is almost monotonic with age, whereas PCI, 
although correlated with age, is not a good age estimator. 




Age Index 

Figure 10. Likelihood solution for the first two principal components, PC 1 
(top) and PC2. Contours are as in Fig. ^. 

parameter estimation, it is not lossless, and it should be no surprise 
that it fares less well than the tailored eigenfunctions of section III. 
If one cannot model the effect of the parameters a priori, then this 
method cannot be used, whereas PCA might still be an effective 
tool. 



6 DISCUSSION 

We have presented a linear data compression algorithm for esti- 
mation of multiple parameters from an arbitrary dataset. If there 
are M parameters, the method reduces the data to a compressed 
dataset with M members. In the case where the noise is indepen- 
dent of the parameters, the compression is lossless; i.e. the M data 
contain as much information about the parameters as the entire 
dataset. Specifically, this means the mean likelihood surface around 
the peak is locally identical whichever of the full or compressed 
dataset is used as the data. It is worth emphasising the power of 
this method: it is well known that, in the low-noise limit, the maxi- 
mum likelihood parameter esimates are the best unbiased estimates. 
Hence if we do as well with the compressed dataset as with the full 
dataset, there is no other method, linear or otherwise, which can 
improve upon our results. The method can result in a massive com- 
pression, with the degree of compression given by the ratio of the 



size of the dataset to the number of parameters. Parameter estima- 
tion is speeded up by the same factor. 

Although the method is lossless in certain circumstances, we 
believe that the data compression can still be very effective when 
the noise does depend on the model parameters. We have illus- 
trated this using simulated galaxy spectra as the data, where the 
noise comes from photon counting (in practice, other sources of 
noise will also be present, and possibly dominant); we find that 
the algorithm is still almost lossless, with errors on the parameters 
increasing typically by a factor ~ -^1 + l/(2s), where s is the 
average number of photons per spectral channel. The example we 
have chosen is a more severe test of the algorithm than real galaxy 
spectra; in reality the noise may well be dominated by factors exter- 
nal to the galaxy, such as detector read-out noise, sky background 
counts (for ground-based measurements) or zodiacal light counts 
(for space telescopes). In this case, the noise is indeed independent 
of the galaxy parameters, and the method is lossless. 

The compression method requires prior choice of a fiducial 
model, which determines the projection vectors b. The choice of 
fiducial model will not bias the solution, and the likelihood given 
the i/m individually can be computed without approximation. Com- 
bining the likelihoods by multiplication from the individual y,n is 
approximate, as their independence is only guaranteed if the fidu- 
cial model is correct. However, in our examples, we find that the 
method correctly recovers the true solution, even if the fiducial 
model is very different. If one is cautious, one could always iterate. 
There are circumstances where the choice of a good fiducial model 
may be more important, if the eigenvectors depend very sensitively 
on the model parameters. An example of this is the determination 
of the redshift z of the galaxy, whose observed wavelengths are 
increased by a factor 1 + z hy the expansion of the Universe. If 
the main signal for z comes from spectral lines, then the method 
will give great weight to certain discrete wavelengths, determined 
by the fiducial z. If the true redshift is different, these wavelengths 
will not coincide with the spectral lines. It should be stressed that 
the method will still allow an estimate of the parameters, includ- 
ing z, but the error bars will not be optimal. This may be one case 
where applying the method iteratively may be of great value. 

We have compared the parameter estimation method with an- 
other linear compression algorithm. Principal Component Analy- 
sis. PCA is not lossless unless all principal components are used. 
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and compares unfavourably in this respect for parameter estima- 
tion. However, one requires a theoretical model for the methods 
in this paper; PCA does not require one, needing instead a repre- 
sentative ensemble for effective use. Other, more ad hoc, schemes 
consider particular features in the spectrum, such as broad-band 
colours, or equivalent widths of lines (Worthey 1994). Each of these 
is a ratio of linear projections, with weightings given by the filter 
response or sharp filters concentrated at the line. There may well be 
merit in the way the weightings are constructed, but they will not 
in general do as well as the optimum weightings presented here. It 
is worth remarking on the ability of the method to separate param- 
eters such as age and metallicity, which often appear degenerately 
in some methods. In the 'external noise' case, then provided the de- 
generacy can be lifted by maximum likelihood methods using every 
pixel in the spectrum, then it can also be lifted by using the reduced 
data. Of course, if the modelling is not adequate to estimate the pa- 
rameters using all the data, then compression is not going to help 
at all, and one needs to think again. For example, a complication 
which may arise in a real galaxy spectrum is the presence of fea- 
tures not in the model, such as emission lines from hot gas. These 
can be included if the model is extended by inclusion of extra pa- 
rameters. This problem exists whether the full or compressed data 
are used. Of course, we can use standard goodness-of-fit tests to de- 
termine whether the data are consistent with the model as specified, 
or whether more parameters are required. 

The data compression to a handful of numbers offers the possi- 
bility of a classification scheme for galaxy spectra. This is attractive 
as the numbers are connected closely with the physical processes 
which determine the spectrum, and will be explored in a later pa- 
per. An additional realistic aim is to determine the star formation 
history of each individual galaxy, without making specific assump- 
tions about the form of the star formation rate. The method in this 
paper provides the means to achieve this. 
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Appendix 

In this appendix, we prove that the linear compression al- 
gorithm for estimation of an arbitrary number M of parameters 
is lossless, provided the noise is independent of the parameters, 
C,a ~ 0. Specifically, loss-free means the Fisher matrix for the set 
of AI numbers ym = bj„x is identical to the Fisher matrix of the 
original dataset x: 

F°p = {a\p). (19) 

By construction, the j/m are uncorrelated, so the likelihoods multi- 
ply and the Fisher matrix for the set {ym} is the sum of the deriva- 
tives of the log-likelihoods from the individual ym- 



Foc0 = y^Fa/3(m). 

m 

From (0), 

Fo,p{m) = {hmii_^){hmfj.i^) 
With ([l4|), we can write 



(20) 



(21) 



(22) 



Hence 

Fafi(m) 



(Qjm) - ^ Fam{q) 



q = l 



mm)-Y:::"F0m{q) 



n\m) - YJ^^l Fmm{q) 
Consider first j3 = m: 

m — 1 

Fam{m) = {a\rn) - ^ Fam{q) 

M 

^ Fc,M = '^F^M(q) ^ {a\M) ^ 

9=1 



(23) 



(24) 



proving that these terms are unchanged after compression. We 
therefore need to consider Fai^ira) for a or /? < m. First we note 
that 



Fai3{m) = 



and, from ( 



Fam{m)Fmf3im) 



(25) 



(26) 



9 = 1 



We want the sum to extend to M. However, the terms from /9 + 1 
to M are all zero. This can be shown as follows: ( ^5| ) shows that it 
is sufficient to show that Famirn) = if m > a. Setting (3 = m 
in (Eq), and reversing a and m, we get 



^F,,m{q) =0. 



(27) 



Now, the contribution from q does not depend on derivatives wrt 
higher-numbered parameters, so we can evaluate Fc,m{ce + 1) by 
setting m = a + 1. The sum ( p7[ ) implies that this term zero. In- 
creasing m successively by one up to M, and using (^), proves 
that all the terms are zero, proving that the compression is lossless. 
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